Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ሺ݊, ݉ሻൌargmin

௜,௝

ሺܠ௜െܠ௝ሻ^ଶ

∀ܠ௜, ܠ௝∈ࣞ

(2.18)

gmin stands for minimising an expression ሺܠ௜െܠ௝ሻ^ଶ through

ing the arguments, which are i and j. The return of this calculation

ction of two indexes of two data points ܠ௡ and ܠ௠, by which the

between them is the least. The notation ∀ means ‘for all’.

two data points (ܠ௡ and ܠ௠) satisfy the above condition and have

ected. They are merged and removed from ࣞ. In addition, the

or the mean) of them is inserted into ࣞ. This mean or median is

meta data The operation for every merge is shown below,

ࣞൌࣞ\ ሺܠ௡, ܠ௠ሻ

ࣞൌࣞ⋃߱௡௠

(2.19)

s an operator called set minus for removing data from a set, ⋃ is

perator called set union for adding new data into a set. ߱௡௠ is the

a, the mean of ܠ௡ and ܠ௠,The size of ࣞ is reduced by one after

rge. For instance, if ࣞ = (1, 2, 3, 4), ࣞ \ ሺ2, 3ሻ removes 2 and 3

eading to a new set ࣞ = (1, 4). Moreover, ࣞൌ ࣞ⋃ 2.5 adds 2.5

lting in ࣞ = (1, 4, 2.5). Note that 2.5 is the meta data point ߱ሺଶ,ଷሻ

ata points 2 and 3. This process continues until ࣞ contains only

data.

der to show how the hierarchical clustering algorithm works, a

of the 20 amino acids was used. To study protein sequence data,

required to use numerical data to encode the amino acids. Doing

ause most machine learning algorithms only accept numerical

e input. There have been a long history of investigating different

rs to encode the amino acids to numerical data [Kidera, et al.,

in, et al., 2007; Lin, et al., 2008; Fontaine, et al., 2019]. Table 2.2

e descriptor system for the 20 amino acids, by which each amino

encoded by three descriptors [Lin, et al., 2008].